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A systematic study of gene expression variation 
at single-nucleotide resolution reveals widespread 
regulatory roles for uAUGs 
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Regulatory single-nucleotide polymorphisms (rSNPs) alter gene expression. Common approaches for identifying rSNPs 
focus on sequence variants in conserved regions; however, it is unknown what fraction of rSNPs is undetectable using this 
approach. We present a systematic analysis of gene expression variation at the single-nucleotide level in the Saccharomyces 
cerevisiae GAL1-10 regulatory region. We exhaustively mutated nearly every base and measured the expression of each 
variant with a sensitive dual reporter assay. We observed an expression change for 7% [43/582) of the bases in this region, 
most of which [35/43, 81%) reside in conserved positions. The most dramatic changes were caused by variants that 
produced AUGs upstream of the translation start (uAUGs), and we sought to understand the consequences and molecular 
mechanisms underlying this class of mutations. A genome-wide analysis showed that genes with uAUGs display signifi- 
cantly lower mRNA and protein levels than genes without uAUGs. To determine the generality of this mechanism, we 
introduced uAUGs into S. cerevisiae genes and observed significantly reduced expression in 17/21 instances [p < 0.01), 
suggesting that uAUGs are functional in a wide variety of sequence contexts. Quantification of mRNA and protein levels 
for uAUG mutants showed that uAUGs affect both transcription and translation. Expression of uAUG mutants under the 
upflA strain demonstrated that uAUGs stimulate the nonsense-mediated decay pathway. Our results suggest that uAUGs 
are potent and widespread regulators of gene expression that act by attenuating both protein and RNA levels. 

[Supplemental material is available for this article.] 



Regulatory single-nucleotide polymorphisms (rSNPs) have gar- 
nered much attention in recent biomedical studies. Evidence has 
revealed that rSNPs contribute to human phenotypic variation and 
can affect disease susceptibility. Furthermore, many disease-asso- 
ciated SNPs identified in genome-wide association studies (GWAS) 
are noncoding and are most likely regulatory in nature (Hindorff 
et al. 2009). However, the identification of rSNPs remains chal- 
lenging. Many researchers have applied computational methods 
to distinguish functional rSNPs from a large number of neutral 
noncoding variations, mostly focusing on SNPs in conserved re- 
gions. While such approaches have identified many functional 
regulatory regions, it is not clear whether they can identify the 
majority of regulatory elements. For example, a recent study ana- 
lyzed transcription factor binding sites in five different vertebrates 
and found that most binding events were species-specific. In fact, 
for one of their transcription factors, CEPBA, only 0.3% of binding 
sites were conserved across all five species (Schmidt et al. 2010). 
Transcription factor binding site (TFBS) "turnover" and sequence 
mutation of binding sites are two mechanisms that may explain 
this high degree of species-specific binding (Odom et al. 2007; 
Schmidt et al. 2010). These results raise an important question: 
How sensitive are alignment-based conservation approaches in 
predicting regulatory elements? More specifically, what fraction of 
nucleotides regulating transcription or translation reside in con- 
served noncoding sequences? 
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In this study, we used the GAL1-10 regulatory region of yeast 
Saccharomyces cerevisiae as a model system and identified single 
nucleotides that affect gene expression in the region of 630 bases 
upstream of the GAL1 translation start site. We created a library of 
single point mutations covering 582 unique nucleotide positions 
in this region. Using a dual-color (CFP/YFP) reporter system (Elowitz 
et al. 2002; Raser and O'Shea 2004), we detected in vivo gene 
expression changes as small as 10%. This nearly exhaustive and 
uniformly distributed mutation library coupled with our sensitive 
detection assay allowed us to quantitatively study the effect of 
rSNPs in relationship with sequence conservation, TFBSs, and 
other functional sequence features. We identified 43 positions in 
the GAL1-10 regulatory region that reduced reporter gene expres- 
sion by >10% upon mutation. The observed changes in expression 
ranged from small perturbations to complete abolishment of re- 
porter gene expression. The majority of mutations affecting gene 
expression occurred in bases that are conserved, supporting the 
canonical view that conservation is a powerful predictor of func- 
tion. For mutations within a binding site, we demonstrated that 
the in vivo expression change correlated with binding energy 
change predicted by the PWM of the Gal4 transcription factor. 

We identified several mutations in our library that caused 
much larger expression changes than those in known TFBSs. These 
mutations produced frameshift uAUGs and completely silenced 
expression. It has previously been shown that uAUGs have strong 
effects on gene expression in both yeast and humans (Calvo et al. 
2009; Hood et al. 2009), so we sought to further understand the 
mechanism by which this widespread and potent class of muta- 
tion affects gene expression. By performing a genome- wide analysis 
of uAUG sites in S. cerevisiae, we found a strong correlation between 
the reduction of gene expression and the existence of uAUG sites. 



22:1089-1097 © 2012, Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/12; www.genome.org 



Genome Research 1089 

www.genome.org 



Yun et al. 



We observed that —21% of genes in 5. cerevisiae have preserved 
uAUG sites and are highly conserved among yeast species. To in- 
vestigate the scope and the strength of uAUG ds-regulation in the 
yeast genome, we introduced the uAUG mutation into 21 randomly 
selected genes without native uAUGs. In 80% of the examined 
genes, the introduction of a uAUG significantly reduced gene ex- 
pression. This effect was independent of the trinucleotide sequence 
context and other gene-specific features. Furthermore, we quantified 
the reduction of mRNA and protein in uAUG mutants for five genes 
and found that uAUGs exert both transcriptional and translational 
control. Finally, by analyzing these uAUG mutants in a nonsense- 
mediated decay (NMD) deficient yeast strain, we demonstrated that 
the NMD pathway plays a pivotal role in the degradation of mRNA 
transcripts caused by uAUG defects. Collectively, these results suggest 
that the creation or destruction of a uAUG site by a single nucleotide 
substitution is an rSNP of strong impact and might be a widespread 
feature in shaping the functional regulatory network in yeast. 

Results 

Creation of a nearly exhaustive single-nucleotide 
mutation library 

We applied two mutagenesis methods to create a single-nucleotide 
mutation library for the GAL1-10 regulatory region: error-prone 
PCR and site-directed mutagenesis. To avoid over-representation of 
mutations at certain positions that were introduced by early PCR 
cycles, we modified the standard PCR mutagenesis protocol by 
performing linear template amplification, rather than exponential 
amplification (see Methods). We cloned the library into a pYlOTY 
plasmid, transformed the plasmid into Escherichia coli, and iden- 
tified mutations by Sanger sequencing. In total, 2764 constructs 
were found to contain one to seven nucleotide substitutions, and 
65% of these constructs contained only a single nucleotide change. 
The average mutation rate of the library was 0.76 mutations per 
construct, a value that was close to the target mutation rate at one 
nucleotide per construct (Supplemental Fig. SI). The library cov- 
ered 615 of the 630 nucleotide positions (98%) in the GAL1 regu- 
latory region, and 533 nucleotide positions were covered by con- 
structs containing only one mutation, indicating that our method 
was sufficient to create a highly enriched single-nucleotide muta- 
tion library (Supplemental Fig. S2). To increase the coverage of the 
mutation library and to validate the constructs generated by the 
above method, we also created constructs with an additional 140 
single-nucleotide mutations by site-directed mutagenesis; some of 
the mutations overlapped with the previous set. 

We selected one construct per nucleotide position and trans- 
formed them individually into yeast cells. In total, 582 constructs 
were chosen for expression analysis. Each construct contained a 
single-nucleotide mutation at a unique position in the GAL1-10 
regulatory region. Among 582 mutated bases, 191 (33%) are tran- 
sitions, and 391 (67%) are transversions (for a summary of the 
mutation spectrum, see Supplemental Tables SI, S2). 

Detection of gene expression changes in GAL1-10 
regulatory variants 

Since small changes in gene expression can have important func- 
tional consequences, it is important to develop a highly sensitive 
assay to detect expression change caused by a single-nucleotide 
mutant. We implemented a dual-color reporter assay in which 
a mutant GAL1 construct drives a YFP reporter gene, and a wild- type 



GAL1 construct drives a CFP reporter gene (Fig. 1A). The CFP re- 
porter acts as an internal control to eliminate the extrinsic noise 
from experimental measurements, estimated at 97% of the total 
noise (Raser and O'Shea 2004). 

We individually transformed 582 constructs into yeast cells to 
create haploid strains that expressed YFP. Each strain was mated to 
a haploid cell expressing CFP under the control of a wild-type 
GAL1 regulatory construct. The resultant diploid strains expressed 
both CFP and YFP proteins at the homologous loci on sister chro- 
mosomes. We then used flow cytometry to measure the ratio of 
YFP to CFP in about 15,000 cells for each strain under the galactose- 
induction condition. The YFP-to-CFP ratio reports the mutation's 
effect on gene expression relative to the wild-type construct (Fig. IB). 
We measured reporter gene expression in six independent trans- 
formants for each member of the mutation library, resulting in 
a total of 3492 measurements. 

By combining this dual-color system and individual cell 
measurements, we achieved highly sensitive gene expression de- 
tection. Our analysis showed that we can reliably detect a 10% 




CFP Fluorescence Units 

Figure 1. Overview of detecting expression variation for the mutation 
library. (A) Design of dual-color reporter system; (red bar) single nucleotide 
mutation; (WT) wild type; (MUT) mutant. (B) An example of determining 
the expression level (mutant strain Mut C352A, position 352, C — > A mu- 
tation). Each dot represents the CFP and YFP fluorescence intensities from 
one cell. (Red dots) Cells from a wild-type diploid strain carrying a gaP-YFP 
and gaP-CFP fusion. The slope represents the mean of the YFP-versus-CFP 
ratio for a population of cells (normalized, k = 1 ). (Green dots) Cells from the 
Mut C352A strain carrying a gal'-YFP and gaP-CFP fusion. The slope rep- 
resents the mean for the mutant population (k = 0.58). The relative ratio 
between two slopes indicates the expression variation. 



1090 Genome Research 



www.genome.org 



Conservation, expression variation, and uAUG 



reduction in gene expression at a false discovery rate (FDR) of 
<0.005. Due to transformational variation, there was less power to 
detect increases in expression (see Supplemental Fig. S3; Supple- 
mental Material I, II). In total, we identified 43 mutants that caused 
a significant reduction in gene expression (Fig. 2; Supplemental 
Fig. S4; Supplemental Table S3; Supplemental Material III, IV). We 
found that the efficacy of the mutation diminished as a function of 
the distance from the GAL1 translation start site, with the strongest 
effects observed for mutations in the 5 ' UTR region, followed by 
those in the TATA-box and in the four Gal4 binding sites. 

The majority of single-nucleotide mutations that change gene 
expression reside within conserved regions 

We first asked if the mutations that reduce gene expression reside 
predominantly in conserved regions. We defined conserved bases 
in two ways: first, as the invariant bases in the sequence alignment 
of four yeast species (Kellis et al. 2003); and second, as the bases 
with significant PhastCon conservation scores. PhastCon (Siepel 
and Haussler 2004) is a program for identifying evolutionarily 
conserved elements from a multiple alignment, and it correctly 



accounts for the phylogenetic relationships between sequences. Of 
the 582 nucleotide positions analyzed, 245 nucleotides were iden- 
tical across four yeast species (Fig. 2), 160 nucleotides were defined 
as conserved with a PhastCon score >0.1 (Fig. 3), and 107 nucleo- 
tides were concordant by both methods. 

Of our 43 bases whose mutation causes significant changes in 
gene expression, 35 (81%) were located in the conserved regions as 
defined by the alignment method (hypergeometric P < 5.3 x 10 ~ 8 ), 
and 27 (63%) were defined as conserved by the PhastCon method 
(hypergeometric P< 1.3 X 10 ~ n ). These 2 7 positions also happen to 
be invariant in the alignment. Both comparisons showed that the 
majority of single-nucleotide mutations that change gene expres- 
sion reside within conserved regions, indicating that searching for 
rSNPs by focusing on conserved regions will likely capture a large 
fraction of, but not all, functional rSNPs. 

Many single-nucleotide mutations that change gene expression 
reside within TFBS 

We next sought to determine whether single-nucleotide mutations 
that change gene expression cluster solely within TFBSs or whether 
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Figure 2. Multiple sequence alignments among four yeast species in the GAL1- 10 regulatory region and gene expression of single-nucleotide mutated 
strains. The sequence alignment is shown according to Kellis etal. (2003). (Axis) The position from -1 to -630 before the GAL1 start codon. (*) Conserved 
positions among four species. (Sequences in blue) The mutated nucleotides. Validated Gal4 and TATA binding sites are boxed in red. Gene expression 
variations for a single nucleotide mutation are indicated with different colors along the position axis in the expression bar. The color bar indicates different 
expression levels (see color map). (Seer) 5. cerevisiae; (Spar) 5. paradoxus; (Smik) 5. mikatae; (Sbay) 5. bayanus; (Mut) mutated nucleotide. 
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Figure 3. Expression variation versus phastCons scores. phastCons scores are plotted in sliding 
windows along the GAL1-10 regulatory sequences, with /-axis labeling on the right (Gray peaks) The 
regions with pronounced signals of sequence conservation. The expression ratio (ER) for each nucleotide 
between YFP and CFP is plotted by colored dots, with /-axis on the left [(red) ER < 0.9; (blue) 0.9 < ER 
< 1 .2; (green) nucleotides with no mutation or expression data]. Below the graphic is a panel of known 
functional elements: (5UTR) 5' UTR; (T) TATA; (M) MIG1 TFBS; (G) Gal4 TFBS. 



some fraction of mutations are located outside of these sequences. 
For example, regulatory variants that act by disrupting nucleosome 
positioning and therefore affect gene expression may be distrib- 
uted throughout the regulatory region. The TATA-box and four 
Gal4 binding sites are known to up-regulate GAL1 under galactose 
induction (Giniger et al. 1985; Kellis et al. 2003). Of the 43 nu- 
cleotides whose mutation causes significant expression variation, 
27 (63%) were located in or adjacent to these sites, indicating that 
the majority of single-nucleotide mutations that change gene ex- 
pression reside within TFBS (Fig. 2). Also, nearly all changes in high 
information content positions of the PWM in these TFBSs signif- 
icantly affected expression, suggesting that TFBSs are enriched 
with rSNPs. In the TATA-box, we observed significant changes in 
all six examined positions, and three of these reduced gene ex- 
pression by >75%. 

Three of the four Gal4 binding sites match the 1 7-bp Gal4 
consensus sequence CGGN ( n)CCG (Fig. 2). The three nucleotides 
at either end of the consensus sequence have the highest in- 
formation content. We found 20 positions within the four Gal4 
binding sites that produced a significant change in gene expres- 
sion, 19 of which were at high information content positions. As 
expected, substitutions at the high information content positions 
in general caused larger changes in gene expression than sub- 
stitutions at other positions. Compared with the TATA-box, the 
expression changes caused by substitutions in Gal4 sites were 
small: The largest variation was a decrease of 32%, and the ma- 
jority of them showed a decrease of —15%. These relatively small 
effects may be explained by functional redundancy of the four 
coexisting Gal4 sites in one GAL1-10 regulatory region (Fig. 4A). 

We next examined how well the Gal4 PWM predicts changes 
in gene expression. We found a significant correlation (Pearson 
correlation, r = 0.58, P < 1.5 X 10 ~ 8 ) between the changes in the 
binding energy predicted by the PWM (Matys et al. 2003) and the 



changes in gene expression among Gal4 
binding sites (Fig. 4B). We also found that 
the changes in expression varied signifi- 
cantly between the different Gal4 binding 
sites, even for the mutations that occurred 
at the same position in the consensus site. 
Furthermore, within the six high infor- 
mation content bases, the change in bind- 
ing energy does not explain all the variance 
of expression change. This suggests that in 
addition to binding energy, other factors 
such as the location of the TFBS play a sig- 
nificant role in gene regulation (Fig. 4A). 

Creation of uAUG sites in the 5' UTR 
of the GAL1-10 regulatory region 
abolishes gene expression 

Sixteen of the positions that showed sig- 
nificant changes in gene expression upon 
mutation reside outside TFBSs and con- 
served regions. Surprisingly, mutations in 
these positions generated some of the 
strongest gene expression changes (Fig. 2; 
Supplemental Fig. S5). Nine of 16 posi- 
tions were located in the 5' UTR of the 
GAL1-10 regulatory region (between the 
major transcriptional start site at -62 and 
the AUG start codon) (Johnston and Davis 
1984). These substitutions may affect gene expression by perturbing 
either transcriptional or translational regulation. Most strikingly, 
three mutations — at positions 16, 36, and 45 — virtually abolished 
gene expression. Further analysis found that each of these mu- 
tations created a frameshift uAUG start codon in the 5 ' UTR re- 
gion. Each of these uAUGs created an upstream open reading 
frame (uORF), and all of these uORFs share the same termination 
codon, which overlaps with the first nucleotide of the canonical 
ATG start codon. Although all three of these mutations have large 
effects on gene expression, the nucleotides at these positions are 
variant across four yeast species; however, these sequences are 
probably still under some evolutionary constraint because no 
species contained a uAUG site at these positions. This example 
presents a nonconventional scenario of conservation, in that cer- 
tain mutations can have major effects (strong functional con- 
straint), but the bases at these positions show relaxed evolutionary 
constraint. 

In the wild-type GAL 1-10 regulatory region, there are no 
uAUGs. Nine positions in the 5' UTR could potentially mutate to 
a frameshift uAUG by a point mutation with a correct substitution 
type. The three positions listed above are the only frameshift uAUGs 
created in the library, and all strongly impact gene expression. The 
remaining six positions (positions at 15, 21, 38, 46, 56, and 61) were 
mutated to trinucleotides other than AUG, and no gene expression 
change was observed (Supplemental Fig. S5). These results led us to 
further investigate the genome-wide regulatory roles of uAUGs. 

uAUGs down-regulate both mRNA and protein expression 
levels genome-wide 

We sought to determine if the uAUGs distributed throughout 
the genome caused expression changes of the same magnitude as 
those we observed at the GAL1 locus. We compared the mRNA 
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Figure 4. Expression variations among four Gal4 binding sites in the GAL1-10 regulatory region. (A) 
The motif pattern of the Gal4 binding site is created based on the position weight matrix reported in 
TRANSFAC. Expression levels of each position in the Gal4 sites are shown with different colors to dis- 
criminate the degree of changes (see color map). (Green circles) Nonmutated positions. Error bars are 1 
standard deviation among different yeast transformants from the same mutant construction. (B) The 
correlation between gene expression and binding energy for the four Gal4 binding sites. The binding 
energy is proportional to the log [freq(Mut)/freq(Wild)] (Stormo 1998). (Red spots) The high infor- 
mation content sites of the Gal4 PWM; (black spots) the low information content sites. 



(Nagalakshmi et al. 2008) and protein levels (Ghaemmaghami et al. 
2003) between genes containing uAUGs and those without. Of 3042 
genes with available mRNA and protein expression data, the median 
of mRNA levels for single uAUG genes was —15% lower (Wilcoxon, 
P < 7.8 X 10" 17 ) than that of uAUG-free genes; the median of the 
protein levels for uAUG genes was ~2.2-fold lower (Wilcoxon, 
P < 4.8 X 10" 11 ) (Fig. 5A; Supplemental Material V). This suggests 
that uAUGs are involved in both transcriptional and translational 
regulation with a large impact at the translational level, an obser- 



vation that agrees with a similar analysis of 
the human genome (Calvo et al. 2009). 

Selection against uAUGs partially 
explains the large number 
of conserved bases in yeast 
5' UTR regions 

Previous studies (Iacono et al. 2005; Hood 
et al. 2009) have shown that uAUGs in 5' 
UTRs are under-represented. To confirm 
that purifying selection has indeed acted 
on the AUG trinucleotide and that the 
observed under-representation was not 
the result of selection against "AU" or 
"UG" di-nucleotides, we used a first-order 
Markov model to compute the expected 
uAUG frequency given the observed di- 
nucleotide frequencies. We calculated the 
expected frequency and compared it with 
the observed uAUG frequency in yeast 
5' UTR sequences defined by RNA-seq 
(Nagalakshmi et al. 2008). We found that 
the AUG trinucleotide was the least com- 
mon of the 64 possible trinucleotides, with 
the observed occurrence being 58% less 
than expected (simulation, P < 1 x 10 ~ 16 ) 
(Supplemental Fig. S6), confirming that 
the uAUG trinucleotide is indeed under 
strong purifying selection. 

We next asked if polymorphic uAUGs 
were less likely to be observed relative to 
the other trinucleotides in the 37 S. cer- 
evisiae strains that have been sequenced. In 
the 365,753 bases of 5' UTR sequences 
(Nagalakshmi et al. 2008), we observed 
10,073 single-nucleotide changes relative 
to the reference strain. Only 1387 of these 
created an uAUG site in at least one of the 
other strains, a number significantly less 
than the expected 1813 SNPs from simu- 
lation (P < 0.0001). Thus, this orthogonal 
analysis provides additional support to the 
hypothesis that the presence of uAUGs in 
5' UTRs is under strong purifying selection. 

It is known that 5' UTRs in yeast are 
highly conserved, despite the fact that very 
few functional sequence elements have 
been found in this region. We hypothe- 
sized that selection against the forma- 
tion of uAUGs might place evolutionary 
constraints on the 5' UTR and explain 
the high level of conservation that is ob- 
served. We estimated that —14% of the nucleotides in the 5' UTR of 
the 3499 non-uAUG genes of S. cerevisiae could be converted into 
AUGs by a single substitution. This may explain, in part, the high 
degree of conservation estimated in 5' UTRs. 



Some uAUGs are conserved and may play regulatory roles 

Although uAUGs down-regulate gene expression and, in general, 
are under strong purifying selection, nearly 21% of the genes in 
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Figure 5. mRNA and protein expression of wild type and uAUG mutant. (A) A box-plot of protein expression between genes without uAUG and genes 
with one uAUG. (B) Comparison of the reporter protein expression driven by regulatory sequence between wild type and uAUG mutant in 21 genes. (C) 
mRNA expression quantified by qRT-PCR. (D) Protein expression quantified by YFP reporter gene. In Cand D, the expression of wild type is normalized to 1, 
and the expression of the uAUG mutant is compared with its correspondent wild type. (Dark blue) Wild type in BY4 74 2 strain; (light blue) uAUG mutant in 
BY4742 strain; (yellow) wild type in upfl-A strain; (brown) uAUG mutant in upfl-A strain. 



yeast have maintained this trinucleotide in their 5' UTRs, sug- 
gesting that this trinucleotide may play a regulatory role. To 
examine whether existing uAUGs were evolutionarily con- 
served, we analyzed the sequence conservation of the 5' UTRs of 
uAUG-containing genes using alignments of four yeast genomes 
(Saccharomyces cerevisiae, Saccharomyces paradoxus, Saccharomyces 
mikatae, and Saccharomyces kudriavzevii) (http://genome.ucsc.edu). 
We used the phyloP score (Siepel and Haussler 2006) to evaluate 
the sequence conservation of each uAUG. Our data showed that 
in 5' UTRs, AUG was indeed the most conserved among 64 tri- 
nucleotides, with an average score of 0.72 versus 0.48 for all 64 tri- 
nucleotides in this region (Wilcoxon rank sum test, P < 7.2 X 10~ 12 ). 

We next asked if existing uAUGs in S. cerevisiae 5 ' UTRs were 
more likely to be conserved in S. paradoxus than other trinucleotide 
sequences. Of the 4333 genes analyzed, we found 990 and 1045 
genes with uAUGs in S. cerevisiae and S. paradoxus, respectively; 866 
of these overlapped. Among all 64 trinucleotides, uAUG was the 
most highly conserved triplet with respect to the overlapping gene 
set (p < 0.05) (Supplemental Fig. S7). Taken together, our results 



suggest that while there is strong selection against the creation of 
new uAUGs, existing uAUGs play important regulatory roles and 
thus are conserved across yeast species. 

Systematic introduction of uAUG mutations in yeast genes 
suggests their widespread regulatory role in gene expression 

Since uAUGs are strongly conserved in many genes and appear to 
have a large effect on gene expression, we hypothesized that the 
presence of a uAUG in the 5' UTR of a gene may be a general 
mechanism that a cell uses to tune down gene expression. How- 
ever, it is possible that the effect of uAUG is dependent on certain 
sequence contexts. To examine if the creation of a uAUG at a ran- 
dom location universally impacts gene expression, we selected 21 
genes whose 5' UTRs do not contain an uAUG but have the po- 
tential to create a frameshift uAUG with a single-nucleotide sub- 
stitution. To maintain consistency with our observation in the 
GAL1 mutation system, we also required that each uAUG site 
initiate a frameshift uORR For each gene, we constructed a yeast 
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strain with a YFP reporter gene cassette integrated at the TRP1 locus. 
The reporter gene is under the control of either a wild type or a 
mutant construct that contains a single base substitution creating 
a uAUG. We compared YFP reporter levels between both wild-type 
and mutant strains using six independent yeast transformants for 
each gene. In all cases, the average expression levels of the mutant 
were lower than in wild type. In 81% (1 7/21) of cases, the expression 
reduction was statistically significant (Student's t-test, P < 0.01), and 
in 14 cases, the expression level of the mutant was reduced by at 
least twofold. These results suggest that the presence of a uAUG in 
the 5' UTR of a gene is a general mechanism by which the cell can 
modulate gene expression. The distribution of the distance for the 
first uAUG sites relative to the canonical ATG start site is shown in 
Supplemental Figure S8. All of the mutation sequences in this study 
can be found in Supplemental Material VII. 

uAUGs regulate gene expression through both transcriptional 
and translational control 

Previous studies suggest that uORFs can regulate gene expression 
through different mechanisms (Hood et al. 2009). To investigate 
whether uAUG regulates gene expression transcriptionally or trans- 
lationally, we directly measured the reporter gene's mRNA levels for 
five of the 21 genes described above. We chose genes that displayed 
the most dramatic changes in the expression of the reporter gene 
upon the introduction of a uAUG. 

All five genes showed a statistically significant reduction 
of mRNA levels between mutant and wild type (Student's t-test, 
P < 0.01) as measured by qRT-PCR. The average mRNA reduction 
was 4.5-fold (range: 2.2-fold to 10.3-fold) (Fig. 5C). Protein levels 
were even more dramatically reduced than the corresponding 
mRNA levels for each gene (Student's t-test, P < 0.01), with an av- 
erage reduction of 15.7-fold (range: 2.8-fold to 43.7-fold) (Fig. 5D). 
The observed impact of uAUGs on mRNA and protein levels was 
consistent with our genome-wide analysis of transcriptomic and 
proteomic data and suggests that uAUGs govern gene expression at 
both the transcriptional and translational level. 

mRNA molecules containing frameshift uAUGs are degraded 
by the NMD pathway 

To test if the NMD pathway degrades uAUG-containing mRNAs, 
we introduced the uAUG mutants described above into a yeast 
strain, upfl-A, that is deficient for NMD. As a control, we also made 
NMD-deficient reporter strains with the wild-type 5' UTRs. In the 
NMD-suppressive background, all five uAUG mutants showed 
mRNA reduction by qRT-PCR quantification (Student's t-test, 
p < 0.05 for four genes, andp < 0.1 for one gene), with an average 
reduction of 1.5-fold (range: 1.2-fold to 1.9-fold) (Fig. 5C). The 
degree of the mRNA reduction in the yeast upfl-A strain is much 
less than in the control BY4742 strain, and four genes showed 
statistically significant differences in mRNA reduction between 
the two backgrounds (Student's t-test, P < 0.05). On average, we 
observed a 36% increase in normalized mRNA levels of uAUG- 
containing genes in the upfl-A strain relative to the wild- type 
strain. This result demonstrates that the NMD pathway plays an 
important role in degrading transcripts with frameshift uAUGs. 

Discussion 

Using the well-studied yeast GAL1-10 regulatory region as a model 
system (Supplemental Material VI), our study comprehensively 



examined the relationship between sequence conservation and 
function at a single-nucleotide resolution. We found that the ma- 
jority of mutations (81%) that cause gene expression changes are 
located in conserved regions or known regulatory regions. This 
supports the canonical view that regulatory elements can be iden- 
tified by sequence conservation (Birney et al. 2007) and stands in 
contrast to a recent study in eutherian mammals suggesting that 
transcription factor binding sites can only rarely be identified by 
conserved bases in a multiple alignment, due in large part to binding 
site turnover (Schmidt et al. 2010). This discordance maybe explained 
by the fact that, while yeast is an excellent model system for the 
study of gene regulation (Cliften et al. 2003; Kellis et al. 2003; Beer 
and Tavazoie 2004), its genome has smaller intergenic regions than 
are found in mammals, and thus binding site turnover occurs less 
frequently. Thirty-nine percent of the bases that have no effect 
on gene expression upon galactose induction are conserved (see 
Supplemental Table S2). Thus sequence conservation appears to 
have modest specificity (61%) as a predictor of rSNPs. 

Although the GAL1 -10 regulatory region has been extensively 
mutagenized (West et al. 1984), our approach identified six mu- 
tations within previously uncharacterized regulatory sites, located 
outside of known transcription factor binding sites and the 5 'UTR 
region. Two out of the six mutations with unknown function are 
conserved among four yeast species. By searching with PWMs 
(Matys et al. 2003; Zhu et al. 2009) using cutoffs of reduced strin- 
gency, we could not find any convincing evidence that suggests 
either the creation or disruption of a TFBS, but three of these 
(-84, -101, and -547) are located near the region protected by 
GAL80 in footprinting experiments (Lohr et al. 1987). Also, al- 
though we could not find direct evidence that the novel sites 
themselves are involved in the creation or disruption of sequence 
motifs that alter nucleosome positioning signals (Kaplan et al. 
2009), position -387 was located in a DNase I hypersensitive re- 
gion and was close to a putative RSC/nucleosome complex binding 
site (Reagan and Majors 1998). 

The biggest changes in gene expression were not due to muta- 
tions in TFBSs, but instead due to the creation of uAUGs. Although 
the effects of existing uAUGs present in the yeast genome have been 
noted previously (Vilela and McCarthy 2003; Hinnebusch 2005; 
Hood et al. 2009), it was not clear whether randomly created uAUGs 
would have consistently strong effects on gene expression. We 
showed that the efficacy of uAUGs is largely independent of sequence 
context. Because 14% of the bases in yeast 5' UTRs can be converted 
to a uAUG by a single mutation, this represents a potent evolutionary 
mechanism for modulating the expression of virtually any gene. 

We found evidence for strong purifying selection against 
uAUGs in yeast 5' UTRs, an observation that is consistent with 
previous studies (Churbanov et al. 2005; Iacono et al. 2005). In 
agreement with Churbanov et al., we also found that, for the 20% 
of genes in yeast that do contain uAUGs in their 5' UTRs, these 
trinucleotides tend to be conserved, suggesting that while in general 
uAUGs are deleterious, for a subset of yeast genes, they play an 
important regulatory role that confers a selective advantage to the 
organism. We further showed that selection against uAUGs may, in 
fact, partially explain the observation that 5' UTRs are highly 
conserved, despite the fact that few regulatory elements have been 
found in these sequences. This observation suggests that some 
nucleotides may be under evolutionary constraint, not because 
they are functional so that a substitution would destroy that func- 
tion, but instead, because certain substitutions may create an ele- 
ment with a new function that could be deleterious to the organism. 
We propose that selection against neomorphic mutations may ex- 
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plain a portion of the surprisingly high degree of conservation ob- 
served in regulatory regions. 

Upstream AUGs can attenuate gene expression by at least two 
mechanisms: (1) translation from the uAUG causes a reduction in 
the translation rate at the canonical AUG (Hinnebusch 2005; 
Medenbach et al. 2011); and (2) the premature termination of 
polypeptides initiated at the uAUG can stimulate the degradation 
of mRNA transcripts via NMD (He et al. 2003). We measured the 
contribution of each mechanism in five yeast genes and found that 
both play significant roles in determining the final protein levels. 
Interestingly NMD accounted for only part of the reduction in 
mRNA levels because we still observed a significant reduction in 
mRNA levels in NMD deficient yeast strains. This may suggest an- 
other, as yet unknown, mechanism by which these transcripts are 
reduced. Alternatively, it may be that the NMD pathway is not 
completely abolished in our upflA deficient strain. 

Because uAUGs have large effects on gene expression and 
occur in the 5' UTRs of many yeast genes, they collectively have 
a substantial impact on protein expression in yeast. A similar pu- 
rifying selection against uAUGs has also been observed in mam- 
mals (Iacono et al. 2005), indicating their functional roles would 
extend to multicellular eukaryotes (Medenbach et al. 2011). SNPs 
that create or destroy a uAUG may represent an important source 
of functional noncoding variation, and disease-associated SNPs in 
these regions would be strong candidates for functional studies. 

Methods 

Construction of plasmids 

Plasmids expressing CFP (pYlOTC) or YFP (pYlOTY) were con- 
structed from a yeast integration vector pRS306 with a selectable 
URA3 marker (Sikorski and Hieter 1989). The following regions 
were inserted into the pRS306 multiple cloning sites: a 447-bp TRP1 
homologous fragment for chromosomal integration, a 468-bp 
ADH1 site, a 630-bp GAL1-10 regulatory region, a yECFP or a yVYFP 
fluorescent protein-coding site (from pJRL2 plasmid derivatives, 
kindly provided by Dr. E. O'Shea) (Raser and O'Shea 2004), and 
a 291-bp 3' UTR from the ACT1 site. 

Mutation library construction 

Random mutagenesis on the GAL1-10 regulatory region was per- 
formed through error-prone PCR, using the GeneMorph II random 
mutagenesis kit (Stratagene). Plasmid pYlOTY was used as the 
template for PCR amplification, and a PCR primer pair was designed 
to flank the GAL1-10 region of the pYlOTY plasmid. The linear PCR 
mutagenesis was preceded by a two-step process: First, a typical er- 
ror-prone PCR reaction was set up as described in the manual 
(Stratagene) with two modifications: (1) only the forward PCR 
primer was added to the reaction; and (2) PCR cycles were ex- 
tended to 50 rounds. Second, the reverse PCR primer was added, 
and the reaction was completed by one more cycle of PCR, fol- 
lowed by a 10-min PCR extension. 

Site-directed mutagenesis was performed as described by 
Dieffenbach and Gabriela (2003). PCR products were amplified by 
Jumpstart Taq DNA Polymerase (Sigma-Aldrich) using plasmid 
pYlOTY as a template. A total of 140 PCR primer pairs, each of 
them containing one mismatched nucleotide from the GAL1-10 
wild-type sequence, were designed. 

Mutagenic PCR products replaced the wild-type GAL1 se- 
quence in vector pYlOTY. E. coli GC10 chemical competent cells 
(Gene Choice) were used for transformation. Clones were picked 
and submitted for sequencing. 



DNA sequencing 

Forward PCR primer 5 '-CCTAAAGTAGTGACTAAGGTTGGC-3 ' 
and reverse PCR primer 5 ' -GGTGTGTATTTTATGTCCTCAGA-3 ' 
were designed to flank the GAL1 -10 regulatory region of the E. coli 
construct. To sequence a construct, we first amplified the plasmid 
DNA using a TempliPhi DNA amplification kit (GE Healthcare). 
We then sequenced each construct four times. Mutagenic con- 
structs were sequenced at the Washington University Genome 
Sequencing Center. 

All sequencing reaction was prepared by Big Dye mix v3.1 
(Applied Biosystems). Sequences from the same construct were 
analyzed, assembled, and viewed using phred, phrap (Ewing and 
Green 1998), and consed (Gordon et al. 1998). 

A single-nucleotide mutation was assigned using a custom- 
ized Perl script based on the following formula: 



P(X/|D) = 



P(D\Xi)P(Xi) 
Y* =1 P(D\Xi)P(Xi) 



where D are the observed sequences; Xi are the mutated bases i = {A, 
C, G, T}; and P(Xz) is the mutation rate over the entire library. 

Strains 

A yeast haploid strain BY4742 (MATa his3A 1 leu2A0 lys2A0 ura3A0 
ARG) and a derivative of a haploid strain BY4741 with a YHR018C 
gene deletion (Mata his3Al leu2A0 ura3A0 metlSAO arg4A0 LYS) 
have been described previously (Brachmann et al. 1998) and were 
kindly provided by Dr. M. Johnston (University of Colorado, 
Denver). A wild- type GAL1 construct expressing CFP and muta- 
genic GAL1 constructs expressing YFP were integrated at the TRP1 
locus of the yeast strain BY4742 and BY4741 derivative, respec- 
tively. Diploids were obtained by mating and further selection on 
synthetic lysine and arginine double-dropout media. 

Yeast transformation 

Plasmids pYlOTC and pYlOTY that contained either wild-type or 
mutagenic GAL1 sites were amplified using a TempliPhi DNA am- 
plification kit (GE Healthcare). The rolling cycle products were 
digested by the AscI restriction enzyme (New England Biolabs), 
followed by the yeast transformation (Gietz and Woods 2002). 

Each mutagenic plasmid was transformed individually, and 
six clones from each transformation event were selected for further 
analysis. To confirm the integration event of an E. coli construct 
into a yeast strain, we sequenced at least one yeast colony for each 
yeast transformation event. The PCR products were then treated 
with ExoSAP-IT (USB) for clean-up and submitted for sequencing. 

Measurement of reporter gene expression 

Each yeast clone was grown in 2-mL 96-well plates overnight at 
30°C in 600 \xL of YPD media. Yeast cultures (5 \xL) were transferred 
to 600 (jlL of synthetic uracil dropout media with 2% raffinose and 
were grown to an OD 600 of <0.5. To induce the GAL1 -10 regulatory 
region, cultures (30 (jlL) were transferred to 600 (jlL of synthetic 
uracil dropout media containing 2% raffinose and 2% galactose. To 
induce the 21 randomly selected genes, yeast strains were cultured 
in 2% glucose media. 

Fluorescence measurement was performed on a Beckman 
Coulter Cell Lab Quanta SC after 4 h and 8 h of induction. For each 
well, the fluorescence intensities of both CFP and YFP were mea- 
sured simultaneously for all 15,000 cells. The expression level in 
each well was calculated by averaging the YFP-versus-CFP ratio for 
15,000 cells. Each plate contained eight control strains with both 
CFP and YFP fluorescence proteins driven by a wild-type GAL1 
construct. To control the plate variation, the expression value of 
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each well was then normalized to the average of the control sam- 
ples on the same plate. Induction and measurement were repli- 
cated for each plate. 

Comparison of genes with uAUGs between S. cerevisiae 
and S. paradoxus 

Sequence alignment between S. cerevisiae and S. paradoxus was 
downloaded from http://www.genetics.wustl.edu/jflab/data4.html 
(genome alignments of three strains combined with S. paradoxus). 
We estimated S. paradoxus 5' UTR sequence based on the RNA-seq 
report for S. cerevisiae (Nagalakshmi et al. 2008). To examine 
whether the overlapped genes with uAUGs is significant between two 
species, we performed hypergeometric tests. We applied the same test 
to all 64 trinucleotides and ranked the P-value of these tests. 

RNA isolation and quantitative real-time PCR assays 

RNA samples were isolated using a standard TRIzol method 
(Invitrogen). Purified RNA was reversely transcribed into cDNA 
using a Superscript III kit (Invitrogen) with random hexamers 
(IDT). Real-time quantitative RT-PCR analyses were performed on 
a Bio-RAD CFX96 real-time PCR detection system (Bio-Rad) using 
an ABsolute Blue QPCR SYBR Green kit (Thermo Scientific). 

Data access 

The sequence data from this study have been submitted to 
GenBank (http://www.ncbi.nlm.nih.gov/genbank) under acces- 
sion numbers JQ6 7621 6-JQ6 7 68 18; accession numbers are given 
in Supplemental Table S4. 
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